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ABSTRACT 

Motivation: Inferring the underlying regulatory pathways within a 
gene interaction network is a fundamental problem in Systems 
Biology to help understand the complex interactions and the 
regulation and flow of information within a system-of-interest. Given 
a weighted gene network and a gene in this network, the goal of an 
inference algorithm is to identify the potential regulatory pathways 
passing through this gene. 

Results: In a departure from previous approaches that largely 
rely on the random walk model, we propose a novel single- 
source ^-shortest paths based algorithm to address this inference 
problem. An important element of our approach is to explicitly 
account for and enhance the diversity of paths discovered by our 
algorithm. The intuition here is that diversity in paths can help enrich 
different functions and thereby better position one to understand the 
underlying system-of-interest. Results on the yeast gene network 
demonstrate the utility of the proposed approach over extant state- 
of-the-art inference algorithms. Beyond utility, our algorithm achieves 
a significant speedup over these baselines. 
Availability: All data and codes are freely available upon request. 
Contact: srini@cse.ohio-state.edu 

Supplementary information: Supplementary data are available at 
Bioinformatics online. 



1 INTRODUCTION 

With advances in high-throughput experimental technologies, we 
are now witnessing a revolutionary change in our ability to measure 
and store various forms of interaction data (e.g. protein-protein, 
protein-DNA/RNA, protein-metabolites and genetic interactions) 
for different organisms. The central objective in Systems Biology 
is to fuse and analyze the diverse molecular, cellular, tissue-level 
and higher level data sources to deduce how sub systems and whole 
organisms work from these network of interactions. 

A node in such a network is usually a gene or its corresponding 
protein, and an edge in a gene network represents a protein- 
protein interaction (PPI) or a transcription factor (TF)-DN A binding. 
A major challenge here is to identify the underlying regulatory 
pathways, each of which is a chain of interacting genes within a 
network. A regulatory pathway begins with a causal gene and ends 
at a target gene, and each gene in a pathway can either activate or 
deactivate some functions of its neighboring genes. The uncovering 
of potential pathways across genes helps biologists understand the 
cellular functions of each gene and protein. As noted earlier, a range 
of experimental methods such as two-hybrid analysis have been 
developed to uncover a large amount of interactions between genes 
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and proteins, and gene knock-out exper iments and some studies 
jBader et all I2004I : iMering et q/lEool have also computed the 
confidence level associated with an interaction. On the basis of 
discovery that genes with high centrality in a gene network usually 
have higher essentiality, lethality and pleiotropy (lHahn and Kern . 
2005 ; Jeong et al. ,_ 200 lh, several research works ( Bebek and Yang , 



2007l:lFroehlich et a/.Ll2007l:IScott et a/.Ll2006l : ISuthram gVa/.Ll2008 : 



Tu et fl/.U200d : IVaske et a/.Ll2009h have focused on inferring the 
regulatory pathways on a number of organisms such as fly, yeast and 
human. 

Given a weighted directed gene network and a specific gene, we 
seek to develop efficient algorithms for the following sub problems: 
(i) UnknownCausal: if the given gene is a target gene, infer possible 
causal genes and their pathways; (ii) UnknownTarget: if the given 
gene is a causal gene, infer possible target genes and their pathways. 
Since, given a target gene, some biological experiments such as 
expression quantitative trait loci (eQTL) analysis can locate an 
approximate location of a causal gene and thus can provide candidate 
causal genes, another sub problem is (iii) CandidateCausal: 
given a causal gene, infer the most likely causal gene among the 
candidates. 

Recently, approaches based on the random walk model have been 
suggested as a means to address this problem. A random walk 
typically starts from the given gene, walks through several nodes, 
and terminates according to some pre-defined parameters, such as 
walk length and edge weights. Generally, the number of visits to 
a node across multiple walks from a given gene gives a measure 
of influence or importance, which in turn provides a measure of 
how l ikely that n ode is involved in a pathway containing the given 
gene. iTu et al. ] d2QQ6h first proposed this model to address this 
inference problem with an additional requirement, rooted in domain 
know ledge, that the random wa lk could visit any nod e at most 
once. ISuthram et al ] d2QQ8h and iMissiuro et al\ J2009h proposed 
the electrical circuit model as an analogy of gene networks, in 
which each edge is a resistor and its conductance is proportional 
to the edge's weight. This circuit model can be solved according to 
Kirchhoff's and Ohm's laws, and the amount of current flowing 
through each node can be interpreted as the possibility of this 
node being involved in a pathway. It has been shown that this 
circuit m odel can be interpreted as the random walk model ( Doyle 
and Snell. ll984h . but it cann ot directly apply on d irecte d edges. 
The information flow model dStoimirovic' and Yul 2009h used a 
similar approach to topic sensitive PageRank (lHaveliwalaL 12003b . 



similar approacn to topic sensitive Fa geKank (UTayeliwalal I2UU jlh 
whereas the electrical ci rcuit model (jSuthram et al. . 20081) and 
PageRank-based method (IVoevodski et all l2009h can be seen as 
the special cases of the information model. As we discuss later, 
these approaches have some important limitations that we seek to 
overcome. 

Specifically, in a clear departure from past work, here we discuss 
a novel approach to solving this inference problem by adapting the 
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k shortest paths algorithm to directly point out potential pathways. 
In this article, we only discuss simple paths, in which a node is 
not allowed to appear more than one time. Intuitively, this agrees 
with the prevailing belief among domain scientists that a regulatory 
pathway is unlikely to repeatedly pass a node. The classic k shortest 
paths simple paths algorithm is due to lYerlll97lh . It executes 0(n) 
times Dijkstra algorithm to generate candidate paths for each of the 
k shortest paths, so its time complexity is 0(kn(m + «log«)), where 
n is the number of node s and m is the number of edges. Recently, 
iHershberger et al\ d2QQ7h classified the candidate paths into different 
clas ses and ad o pted a replacement-edge strategy for each class, 
and iGao et all (120 id) used the transformed graph with side cost 
to improve the efficiency of Yen's algorithm, but both of them have 
the same worst-case time complexity and space complexity as Yen's 
algorithm. All the algorithms discussed earlier address the single- 
pair problem, which is to find the k shortest paths between a pair 
of nodes in a weighted graph. However, in our inference problem, 
we need to calculate the k shortest paths from the given node to 
each other node to approximate the potentiality of each gene being 
involved in a pathway. For this reason, we introduce a single-source 
k shortest paths algorithm. 

Our algorithm is also based on Dijkstra algorithm, moreover 
adopting a data structure called pseudo tree to reduce the space 
storing all single- source k shortest paths. We show that an exact 
algorithm to solve this problem has high complexity and therefore 
we propose a heuristic approximation algorithm that appears to work 
well in practice — the single- source shortest paths algorithm. We 
note that the shortest paths may often have high overlap (low 
diversity, since may share a large number of edges) and one of our 
objectives is to explicitly identify diverse shortest paths. The domain 
intuition here is that, diverse paths between two genes might help one 
uncover non-redundant transduction pathways which are of interest 
to domain scientists. To accommodate this additional requirement, 
we propose a variant of our basic approach — a single-source k 
diverse shortest paths algorithm. 

Following best practices in silico validati on procedures outlined 
by previous re s earch on this problem (iBever et al 
ISuthram et ~all 120081: iTu et all 120061) . we demonstrate the 
efficacy and efficiency of the proposed approaches to inferring 
regulatory pathways on the Yeast gene network. Specifically, on 
the CandidateCausal sub-problem, our shortest paths algorithm 
achieves signifi cantly higher accuracy th a n extant state-of-the- 
art approaches (IStoimirovic" and YuL 120091 : ITu et all 120061) . and 
on the UnknownCausal and UnknownTarget sub problems, 
the importance values assigned by our shortest path variant that 
explicitly accounts for diversity in paths, not only achieves higher 
enrichment but also enriches different functions. In terms of 
scalability, when com p ared w ith the exact single-pair shortest 

1 >aths algorithm (bfenL Il97ll) « as well as some recent variants 
Malviva et all l201ll) , on the yeast gene network, our method 
achieves up to two orders of magnitude speedup. We should also note 
that while our approach relies on a simple heuristic approximation 
procedure, we empirically observe no difference in quality when 
compared with the exact algorithm. 



2 PROBLEM DEFINITION 

Let G = (V,E) denote a gene network, which is a weighted directed 
graph, where V is the set of nodes (genes or proteins), E is the 



set of directed edges (interactions) and n=\V\,m=\E\. Each edge 
eeE denoted by (g a ,gb)> §a,8b e ^^ represents the direction of 
interaction from g a to g/>. If an interaction is undirected, e.g. a PPI, 
the corresponding edge is bidirectional. w(e)e[0, 1] is the weight 
of an edge e, and the weight represents the confidence level of the 
interaction. We introduce how to generate the weight in Section I57TI 

Given a weighted graph as a gene network and a gene in 
this network, we seek to solve the following three sub problems: 
(i) UnknownCausal: infer possible causal genes if the given gene 
is a target gene; (ii) UnknownTarget: infer possible target genes if 
the given gene is a causal gene; and (iii) CandidateCausal: infer 
the true causal gene in a given set of candidate causal genes, if the 
given gene is a target gene. The target gene and the causal gene are 
denoted by g t and g c , respectively, and the set of candidate gene is 
denoted by C, where g c e C and C C V. 

The goal for CandidateCausal is to correctly select the true 
causal gene from C. An algorithm for solving UnknownCausal and 
UnknownTarget is to give an importance value to each node except 
the given node, where higher importance value indicates a higher 
chance that a gene is involved in a pathway containing the given 
gene, and the goal is to assign the potential causal genes or potential 
target genes and the nodes in the pathways higher importance values 
than other genes. The importance value of a gene g a is denoted by 
Imp(ga). 



3 REVIEW OF THE RANDOM WALK MODEL 

We begin by review i ng the information flow model introduced by 
IStoimirovic and Yul feoilh . since other algorithms based on the 
random walk model can be thought of as specific instantiations of the 
information flow model. The information flow model first constructs 
a nxn transition matrix P by normalizing the adjacent matrix A of 
the graph G, where Ay = w((gi,gj)) and Py = aAy/J2k A ik- a e (0, 1) 
is called 'damping factor'. When a random walk reaches a node gj, 
the walk would terminate at gi with probability 1 — a and go to 
another node gj with probability Py in the next step. A random walk 
starts from a source node and once the walk reaches a sink node, the 
walk immediately terminates. The walk simulates the information 
flow in gene networks: the flow starts from the causal gene (source 
node), follows the pathway and finally reaches the target gene (sink 
node). 

The random walk model has two serious problems. The first 
problem is that the normalization of edge weights is lossy. 
Figure[T]a) shows an example. Given the source node S, it is apparent 
that Imp{A) > Imp(C) = Imp(D) in the right graph, but in the left 
graph, it is not clear whether C and D is more likely to be involved in 
the pathway starting at the source node than A. However, both graphs 




Fig. 1. Two problems of the random walk model, (a) Normalization of edge 
weights is lossy, (b) The walks repeatedly go through the bidirectional edge 
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generate the same transition matrix, as shown in the middle graph, so 
the transition matrix cannot fairly reflect the original weights. One 
approach to solving this problem is to use different damping factors 
for different nodes. If the sum of the weights of all outgoing edges 
for a node is higher, a should be increased to decrease the chance 
a walk terminates at this node, but there is no intuitive approach to 
tuning it. Another approach is using the electrical circuit model, but 
it is difficult to set the voltage to fairly reflect the information flow. 

The second problem is that a walk can repeatedly go through 
the same bidirectional edge. As shown in Figure QIb), in which 
every edge has the same weight, the right graph is similar to the 
left graph but the edge (B,D) is bidirectional. Given the sink node 
K, it is intuitive that Imp(A) = Imp(C) in both graphs. However, 
the random walk model assigns Imp(A) a higher value than Imp(C) 
in the right graph since walks starting from C might go back to 
C before termination. This problem can be solved by forcing the 
random walks to visit only unvisited nodes, but this in turn leads to 
other problems [e.g. restrictive walks and scalability dSuthram et all 
l2008l) l. Another possible solution for this problem is to predict the 
direction of each bidirectional edge before random walks start, but 
both directions are sometimes used by different regulatory pathways. 
In this work, we seek to leverage the the single- source shortest 
simple paths algorithm to address these issues. 

4 SINGLE-SOURCE AC-SHORTEST PATHS 

4.1 Preliminaries and key intuition 

The random walks simulate the information flow in gene networks. 
The walks visit genes connected by higher weighted edges more 
frequently than genes connected by lower weighted edges, and 
distant genes from the given gene tend to be assigned lower 
importance values, agreeing with domain intuition. To capture 
similar intuitions when using the shortest path variants, we first 
convert the weight of an edge into a distance as follows: d(e) = 
— \og(w(e)) + c (We add a constant c, typically set to 1, because 
each edge should have a minimum distance, similar to a Laplacian 
correction.), where denotes the distance of an edge or a path. 
Then, the importance value of a gene g a is defined as Imp(g a ) = 
Y^li=\ ' where Pi , P^ , . . . , Pk are ^-shortest paths from the given 
gene to g a , With this transformation, it is obvious that genes having 
shorter distances to the given gene of interest will be assigned greater 
importance values. We also note that the electrical circuit model 
has an equivalent representation w.r.t. the ^-shortest paths algorithm 
it takes all paths into accounts) as pointed out by a recent study 
Missiurog7aZll2009h . 

A regulatory pathway, which is from a causal gene to a target gene, 
mig ht involve in multiple paths instead of a single path ( Suthram 
et al. j2008l : lTu et q/.Ll2006h . so the ^-shortest paths is able to directly 
point out these potential regulatory pathways. A regulatory pathway 
might not pass a gene more than once, so we only discuss simple 
paths, in which a loop is not allowed. Most importantly, since the 
^-shortest paths algorithm does not normalize edge weights and 
it can be directly applied on directed edges, it does not have the 
problems of the random walk model. 

The inference problem is now a single- source shortest paths 
problem, in which the starting node is the given causal gene when 
solving UnknownTarget or the given target gene when solving 
UnknownCausal and CandidateCausal. Since the original 



direction of edges regulates information from the causal gene to 
the target gene, the direction of all edges should be reversed first 
when solving UnknownCausal and CandidateCausal. To find 
paths from the target gene to the causal gene. Additionally, in 
CandidateCausal, we select the candidate causal gene with the 
highest importance value as our predicted causal gene. To the best of 
our knowledge, the state-of-the-art ^-shortest paths algorithms only 
address the single-pair problem and their w orst-case time com plexity 
on a directed graph is 0(kn(m + n\ogn)) Thus, 
it is time consuming to run a single-pair algorithm n — 1 times to 
solve the single- source problem, and thus a single- source algorithm 
should be applied to this inference problem. 

4.2 Single-source A>shortest paths algorithm 

To store all ^-shortest paths, we adopt a data stru cture called pseudo 
tree. The previous definition dGao et a/1 120101) of a pseudo tree is 
for single-pair problem, so it only stores shortest paths from the 
starting node to the destination node. We expand this definition to 
store all shortest paths from the starting node to each other node. 

Definition 1: Given all top ^-shortest paths from the starting node 
to each other node, a pseudo tree stores all paths in a tree structure. 
If k = l, the pseudo tree is equivalent to the shortest path tree; as 
k> 1, all 2nd to k-th shortest paths are iteratively merged into the 
pseudo tree by sharing the longest common prefix path. 

Definition 2: A tree-path is a path from the root to another node 
in a pseudo tree. 

An example is given in Figure[2] where all top three shortest paths 
from S are represented by the pseudo tree. Note that because every 
top shortest path should be a simple path, no nodes repeatedly 
appear in a tree path. Let A denotes a path from a node A to 
another node B through more than one edges, and A -> B denotes the 
path from A to B through exact one edge (A, B). P\ +P2 denotes a 
path formed by concatenating P\-P2- Most of prefix paths of a top 

shortest paths are also top shortest paths; however, some prefix 
paths are not. For example, the path S^A^C^D 'm Figure [2] is 
a prefix path of the third shortest path from S to E, S^A^C^ 
D^E, but it is not a top three shortest path from S to D. 

Definition 3: Path nodes and dummy nodes. A tree-path to a path 
node is a top-/: shortest path from the root to this path node, while 
a tree-path to a dummy node is not. 

Therefore, in the above example, we call the node D in the tree- 
path S^A^C^D^Esl dummy node and all other nodes except 
S path nodes. We will show that a complete pseudo tree might 
contain a huge number of dummy nodes, so we develop a heuristic 




Fig. 2. A graph and its corresponding pseudo tree. k = 3. The node with a 
dash circle is a dummy node 
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algorithm to construct a pseudo tree with only path nodes. First, we 
discuss what topological conditions can generate dummy nodes. 

Theorem 1: If a top ^-shortest path from the root S to a node A 
contains a dummy node P, let the sub-path from the dummy node 
to A be Pba, then f° r eacn t0 P ^-shortest path from S to P, denoted 
by PsB, either P$b contains a node in Pba besides B or there exists 
a top shortest path from S to A using P$b as a prefix path. 

Proof. Let a top shortest path from S to A containing a dummy 
node B be P and the subpath of P before and after the dummy node 
be P\ and P2, respectively (P=P\ +P2). We prove it by showing 
that if there exists a top ^-shortest path from S to P, denoted by 

, where does not contain any node in P2 except B and no top 
^-shortest path from S to A uses P^ as a prefix path, then such P 
does not exist. 

Since P^ does not contain any node in P2 except P, we can build 
a new path from S to A: P^ +P2; additionally, P^ is a top ^-shortest 
path from S to B and Pi is not, so d(P) > d(P^ +P2). Since no top 

shortest path from S to A uses P^ as a prefix path, P^ +P2 is 
not a top ^-shortest path from S to A. Therefore, P cannot be a top 

shortest path from S to A either, so such P does not exist. ■ 

For example, for the dummy node D in S^A^C^D^E in 
Figure [2] two top three shortest paths from S to D, S^B^D and 
S^B^E^D, contain B, and the other one, S -> A -> Z), is a prefix 
path of the 2 n j shortest path from S to P. 

Theorem 2: Given a graph G, the pseudo tree for the shortest 
paths contains at most 0(n 2 k) nodes. 

Proof. According to Definitions [T] and |3 a pseudo tree should 
contain exactly (n— l)k path nodes in order to indicate ^-shortest 
paths from the starting node to each other node. Note that the pseudo 
tree must contain the shortest path tree with the root S, and all n — 1 
path nodes in the shortest path tree do not require any dummy nodes. 
However, each of other (n— l)(k — 1) path nodes might requires at 
most n — 2 dummy nodes, so the upper bound of the number of 
dummy nodes is (n —\){n — 2)(k — 1), and the total number of nodes 
is 0(n 2 k). M 

Figure [3] shows an example with 0(n 2 k) dummy nodes. There 
are four types of nodes besides the starting node in this example, 
A/, Bi, C[ and P/y, and the number of nodes of each type is q, q, 
k, {q—\){k—\), respectively, q e 0(n) (recall that k is a very small 
number). We construct all edges in the following procedure: (i) for 
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2 < i < q, A/ has one outgoing edge (A; , A;_ \ ) and one incoming edge 
(B q ,Ai); (ii) for 1 <i<q — 1, Bi has one outgoing edge (P;,P; + i); 
(iii) every Q has two edges {A\ , Q) and (Q,Pi); (iv) every P/y has 
one incoming edge (A/,P/y) and one outgoing edge (Pij,B\)\ (v) The 
source node S has one outgoing edge to A q . The distances of all edges 
can be ignored except that d(A i ,P i j)+d(Pij,Bi) = \O i ,2<i<q,\ < 
j<k—l. The (top-1) shortest path from S to A/ is S^Ag=^A;, 
and k shortest paths from S to P; are A q -> A q _ \ =^A\ -> Cj -> 
B\ =^Bi,\ <j < k. Since all k shortest paths from S to B q contain all 
A nodes, we cannot form the 2nd to &-th shortest paths from S to 
A/ by using any of them as a prefix path. Thereby, the 2nd to &-th 
shortest paths from S to A/ are S^Aq^A^i -^P^i j^B\ =^ 
B q ^Ai, l<j<k. All B nodes in these paths are dummy nodes, 
so each A(, 1 <i<q— 1, requires (k—l)q dummy nodes. The total 
number of dummy nodes in this pseudo tree is therefore at least 
q(q-l)(k-l)eO(n 2 k). 

Since the number of dummy nodes in some cases is extremely 
large, an exact algorithm to construct the complete pseudo tree 
would be too time-consuming. As shown in Section 15.41 dummy 
nodes are actually very rare in real networks, so we adopt a heuristic 
algorithm that builds a pseudo tree T consisting of only path nodes. 
Our algorithm to build the pseudo tree is similar to the Dijkstra's 
algorithm: we add a node to the tree each time. Let N x denote a 
node in graph G and N' x denote one of the corresponding path nodes 
to a node N x € G. Note that a pseudo tree can have at most k path 
nodes corresponding to a single node in the graph. Starting from the 
root S, we iteratively expand the pseudo tree by adding a new path 
node N f b and an edge {N f a ,N f b ) to the pseudo tree, where N' a eT and 
(Na,Nb)eE; thereby, a new tree-path S^N' a ^N' b is formed. We 
require that: (i) the tree-path S^N' a does not contain N' b , since the 
new tree-path should be a simple path; (ii) T has at most k — \N' b ; 
and (iii) the new tree-path should have the smallest distance among 
all new possible paths through an expandable edge, i.e. 



b — ?LVgmind(S- 



>N f a ^N x ):N f a eT,(N a ,N x )eE. 



Fig. 3. An example with 0(n 2 k) dummy nodes, (a) The graph. Each bold 
arrow represents k—1 paths, (b) The pseudo tree. The red box contains total 
(k — l)q dummy nodes in the paths from S to A q -\ 



Thereby, the new tree-path S^N f a ^N b is a top-& shortest paths 
from S to Nfr containing only path nodes. 

The pseudo code is shown in AlgorithmQ] The min priority queue 
pq stores entries with the format <N x ,(N x ,N y ), dis >, where N x is 
a path node in T, (N x ,N y ) is an expandable edge, and dis is equal 
to d(S^N x )+d((N x ,N y )). pq always pops out an entry with the 
minimal dis among all entries, county) is the total number of 
nodes N x in T plus the number of entries containing an edge to N x 
in the priority queue. At the initial phase (lines 1-4), the starting 
node S becomes the root in T and all edges from S are inserted 
into pq. We then iteratively pick up an entry from pq and add the 
corresponding edge to T (lines 5-7). Each time we add a new edge 
(N' a ,N' b ) to r, we find all possible expandable paths S^N b ^N f c , 
which have less distances than the current &-th shortest path from S 
to N c (line 8). We then insert these paths into pq (line 11) or update 
the entry (line 13) depending on whether count(A^ c ) is already equal 
to k. The whole procedure is finished when there are no entries 
in pq. 

We can further constrain the length of a tree-path (the height of the 
pseudo tree) to be less than a constant h. Since a regulatory pathway 
usually involves in a limit number of genes, h should be set to the 
maximum number of genes in a regulatory pathway. Once the new 
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Algorithm 1 ^-shortest paths algorithm 



Input: A weighted graph G = (V,E), the starting node S, and k. 

Output: A pseudo tree T representing all top ^-shortest paths containing 
only path nodes, and arrays of distances Arr, where Arr(Ni)\j] storing 
the distance of the (/+ l)-th shortest path from S to fy. 

1 . For each node N( e V, assign an array Arr(Ni) that consists k values. All 
values are initialized to oo. 

2. count(Af; ) <*- 0 for each node Nf e V 

3. Put the root <S,0> in T. 

4. For all edges e = (S,N x )eE, put <S,e, d(e)> in a priority-queue, pq, 
and count(A^) <— 1. //pq is a min priority queue. 

5. while pq is not empty do 

6 . < N' a , (N a ,Nb), dis > <- pop-min(pg) 

7. concatenate < (N a ,Nb), dis > to N' a in T. //add a new path node N' b 
and an edge (N' a ,N' b ) to 7\ 

8. for all e = (N b ,N c )eE: dis+d(e) <Arr(N c )[k] and N c is not in this 
tree-path S =^N' a do 

9 . if count(iV c ) < k then 

10. count(A^ c ) 4- count(N c ) +1 

11. put <N' b ,e, dis +d(e)> in pq 

12. else 

13. update the corresponding entry of Arr(Nc)[k] in pq to < 
N f b ,e,dis+d(e)>. 

14. Arr(A^c)[^] ^dis+d(e) and sort Arr(N c ) //only need to move 
Arr(tf c )[*]. 

path node added in line 7 is in the hth level in the pseudo tree, the 
following code (lines 8-14) would not be executed. 

Theorem 3: The time complexity of Algorithm[T]is 0(nk\og(nk) + 
mk{h+k)). 

Proof. Count(A/^) in Algorithm [T] ensures that each node is 
inserted into the priority queue at most k times. This guarantees 
that the number of nodes in the final pseudo tree plus the number 
of entries in the priority queue is always less than or equal to nk, 
and the number of times of insertion is at most nk. Therefore, the 
time complexity of insertion into the priority queue is 0(nk\og(nk)), 
The time complexity of an update in the priority queue is 0(1) 
if the Fibonacci heap is adopted, and we need to update an entry 
at most 2mk times since each edge might cause an update if one 
of its incident vertices is poped out as N a in line 6. Moreover, 
we need 0(h) time to check whether N c is not in the tree-path 
S^N^ (line 8) and 0(k) time to update the sorted array (line 
14). The maximal number of times we need to execute above two 
operations is mk. Hence, the total time complexity of Algorithm [T] 
is 0(nk\og(nk)+mk(h+k)). ■ 

Note that if we execute n— 1 times Yen's algorithm, the time 
complexity is 0(nhk(m+n\ogn)), so our algorithms provides a 
faster solution. 

4.3 Diverse A>short paths 

The shortest paths are usually very similar to each other because 
replacing few edges in the shortest path can yield several short paths. 
Thus, if we only take into account distance, we might overemphasize 
on these similar paths. Studying the diverse path might identify 
other more interesting regulatory pathways, so we introduce a 
single- source diverse k- short paths algorithm. 

We define the diversity of a path as the number of edges in this 
path but not appearing in any already found paths divided by the 



total number of edges in the path. Formally, if we already find k < k 
diverse paths P\,P2,...Pk, the diversity of a new candidate path 
Pnew IS! 



div(Pnew) = 



\{e\eePnew,$Pr.eePi,l<i<K} 
\{e\eeP mw }\ 



Note that 0 < div(P) < 1 , and if there are no found paths, the diversity 
is always 1 . A path P is considered as a diverse path if and only if 
div(P)>X, where A is a user- specified threshold. The larger A is, 
the more diverse the found k paths are, but the total distance of the 
found k diverse paths would be larger. 

The most intuitive approach to find k diverse short paths is 
executing Algorithm[T]to find k' > ^-shortest paths from the starting 
node to each other node, and for each node, we try to select k diverse 
paths among the k' shortest paths. However, if the diverse paths are 
not enough, we need to repeat the procedure until k diverse paths 
are found for every node. Since the time complexity of Algorithm[T] 
is not linear in k, and there is no intuitive way to set an appropriate 
increasing amount of k, this procedure is time-consuming. 

Instead, we adopt a strategy that we remove some but not all 
edges appearing in the pseudo tree from the graph and then re-run 
Algorithm[T]with the new graph until all k diverse paths are found for 
every node. We tend to remove edges that appear in a large number 
of £- shortest paths to yield diverse short paths while destroying the 
topology of the original graph as little as possible. Moreover, the 
removed edges should not be in the first few levels of the pseudo 
tree since they are usually used as a prefix path for several short 
paths. For these two reasons, we simply count the number of times 
that an edge appears in the constructed pseudo tree and remove edges 
which frequently appear in the pseudo tree. Since each edge must 
appear at most k times in the pseudo tree, We set the probability 
of an edge e being removed to count(^)/^ if count(^) >k/2, where 
count(<?) is the number of times that e appears in the pseudo tree; 
otherwise, the probability is 0. 

The procedure of our single- source shortest diverse paths 
algorithm is described as follows: we first execute Algorithm Q] and 
then for each node, we examine the diversities of k shortest paths 
in the increasing order of their distances. If the diversity of a path 
exceeds the threshold k, the path is identified as a short diverse path. 
If the diverse paths are not enough (< k) for some nodes, we remove 
edges from the graph according to the probability described earlier. 
Then, we repeat the procedure until k diverse paths are found for 
every node or less than m/n edges are removed in the last iteration. 
Eventually, the importance value of each node is calculated as in 
Section PO 



5 EXPERIMENTS 

5.1 Dataset preprocessing 

We report results on the yeast gene network. The overview of the 
generation procedure of our yeast gene network and gold standards 
is shown in Figure [4] and follows best-practice in silico validation 
outlined by pri or research jBever et a/.Ll2006l : ISuthram et a/.LEooi 
iTu et q/.ll2006l) . The gold standards were extracted from the k nock- 
out compendium elucidated by Rosetta (iHughes et al l l2000h . The 
deleted genes in the experiments were assumed to be causal genes 
and the highly perturbed P- value < 10 -4 , genes were seen as target 
genes. The set of candidate causal genes for each gold standard, 
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Fig. 4. Overview of our inference framework 

generated by simulating eQTL region, is from (iTu et all [2006h . 
Using this procedure, 1728 gold standards were eventually extracted. 
Each gold standard consists of a causal gene, a target gene and 10 
candidate causal genes, and when solving the three sub problems, 
UnknownCausal, UnknownTarget and CandidateCausal, the 
input is the target gene, the causal gene and the target gene with 10 
candidate causal genes, respectively. The causal gene is also used to 
evaluate the accuracy in CandidateCausal. 

To construct a weighted gene network, the TF-DNA bindings and 
PPIs were first comb ined into an unweig hted network. TF-DNA 
binding data is from iBever et~al\ (120061) , and only the bindings 
whose likelihood scores are higher than 6.0 were used, resulting in 
a tot al of 4842 TF-DNA bindings. PPI data is from BioGRID ( Stark 
et al. J2006I) , and only physical interactions found by low- throughput 
experiments were used s ince these interactions have higher precision 
dPaccanaro et a/.lE005b . Following this process, we constructed an 
unweighted network with 4304 nodes and 18 413 edges. We then 
used the Pearson correlation coefficient of the gene expression level 
in Rosetta compendium to weight edges. The cond itions of gene 
expression we selected are based on lTu et al. 1 d2006h . 

Since a gene which is likely to be involved in a regulatory pathway 
involving the given gene tends to have higher expression correlation 
with the given gene, edges (g a ,gb) an d igb^a) are both assigned to 
a weight 

max((Pearson(ga,gt)+Pearson(g b ,gt))/2, e) 

in UnknownCausal and CandidateCausal, where g t is the target 
gene, and in UnknownTarget, the causal gene g c replaces g t . Note 
that we assigned different weights to an edge for different gold 
standards and for different sub problems. We do not use two incident 
genes' expression correlation, since a 'date hub', which is a node 
with high degree but the average gene expression correla tion with 
its neighbors is low lHan et all Eooi |jm~ et al 1 l2007h . is more 
likely to be involved in a pathway than other node. The minimum 
weight s is applied since the expression levels of the genes in a 
pathway do not all necessarily correlate with the given gene. We 
set s to 0.15, the maximal height of the pseudo tree h to 15 in 
the following experiments. We found that k > 10 does not generate 
results with significantly different quality from k=\0. Therefore, k 
is varied from 1 to 10 in this article to examine the trade-off between 
efficiency and quality of the result. 

The experiments were performed on a machine with Intel core 
i5 650 and 16GB of main memory. The information flow model 
dStoimirovic and YuL l201ll) was implemented in MATLAB, and 



Table 1. Accuracy and computation time of CandidateCausal 



Algorithm 


Accuracy (%) 


P-value 


Time(s) 


k = 10 shortest paths (s.p.) 


34.03 




0.1 


k = 10, X = 0.5 diverse s.p. 


34.08 


0.2265 


1.43 


k = 1 s.p. 


30.38 


1.09*e-6 


0.01 


IF model (a = 0.55) 


30.09 


1.52*e-4 


38.02 


IF model (a = 0.85) 


29.11 


4.76*e-6 


38.42 


Tu's 


22.34 


2.46*e-23 


0.1 



The diverse short paths algorithm with different k yields very close accuracy and thus 
only A. = 0.5 is shown here. 

Tu's algorithm ( ITu et all I2006L Yen's algorithm (sour c e cod e 
is from http : //co de . google . com/p / k- shor test-paths/) (lYenL Il97ll) « 
Y-STATISTICAL dMalviva et all 1201 lb and our algorithm were 
all implemented in Java. 

5.2 CandidateCausal 

The goal of CandidateCausal is to correctly pick the true causal 
gene from the ten candidate causal genes. The accuracy is calculated 
by dividing the number of gold standards in which the algorithm 
correctly chooses the true causal gene by the total number of 
gold standards. Table [T] shows the prediction accuracy, time to 
compute, in CandidateCausal and the P-values of the paired 
Binomial test, which examines the significance of the difference 
between the prediction results of our shortest paths algorithm (top 
line) and the other algorithms we evaluate. Y-STATISTICAL is too 
slow to process all gold standards in feasible time, as shown in 
Section 15.41 so we do not report it. The computation time is the 
average on each gold standard. The values of the damping factor a 
in the information model do not affect the accuracy significantly 
in the suggested range (0.55-0.99): the accuracies range from 
29-30%, so we just show two results with a = 0.55 and a = 0.85. 
Note that since each gold standard has 10 candidates, a purely 
random strategy would be expected to yield a predictive accuracy 
of around 10%. Both our approach and the state of-the-art baseline 
candidates perform much better but accuracy is limited because 
the gold standards extracted from the knock out compendium are 
noisy, and the gold standards themselves represent only the domain 
knowledge available to biologists currently which is known to be 
incomplete. In terms of results, our approach represents a significant 
improvement over the previous state-of-the-art as measured by the 
paired binomial test. 

We find that for this problem the diverse paths algorithm evaluated 
on different values of A cannot significantly enhance the prediction 
accuracy; nevertheless, we will show that the diverse paths can 
improve the enrichment levels and enrich different biological 
functions in Section 15.51 We should reiterate that in comparison 
to the random-walk-based approaches, not only do we see a clear 
benefit in terms of accuracy (and in some cases speed) but it is also 
much easier to directly identify the potential regulatory pathways 
by the k shortest paths algorithm. 

5.3 UnknownCausal and UnknownTarget 

For th e next set of experiments, we used SaddleSum ( Stoimirovic" 
and Yu. l201Qh to evaluate the enrichment levels of the GO terms 
(lAshburner et all l2000h for the results of UnknownCausal and 
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Fig. 5. Evaluation using GO ann : (a/b) The sorted e-values of the top GO term, (a) UnknownCausal using different k, (b) UnknownTarget using different k. 
The information model uses a = 0.99. (c) The sorted maximal differences of e-values in each gold standard in UnknownTarget between the shortest paths 
and the diverse short paths using different X. k = 5 
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Fig. 6. Evaluation using GO a \\. (a) UnknownCausal using different k, (b) UnknownTarget using different k and (c) Maximal differences. The description 
is the same as Figure \5\ 



UnknownTarget of a gene set V. In the result, each gene gj e V 
is assigned an importance value Imp(gi). SaddleSum is a state-of- 
the-art software to approximate the probability of randomly picking 
\Vf\ genes from V whose sum of the importance values exceeds 
^2 gi eV f I m P(gi) f° r eacn function/, where Vjcy denotes the set of 
all genes annotated with the function/. For each function, a e-value 
is reported by SaddleSum, which is the probability after applying 
the Bonferroni correction to the function's P- value. If the e-value 
of a function / generated from SaddleSum is less, it is more likely 
that the pathways involving the given gene regulates the property 
of /. To exclude high-level GO terms, we only examine the GO 
terms which less than 100 genes in the yeast gene network have. 
The resulting set, denoted by GO a n, contains 1764 GO terms. The 
subset of GO a u in which GO terms are annotated with the given 
gene is denoted by GO ann . Note that different gold standards have 
different GO ann . GO ann is used to verify our result; on the other 
hand, GO a u is used to point out potential GO terms which might 
associate with the given gene. 

The lowest e-value among GO ann is used to evaluate the 
function enrichment for each gold standard. These e-values of 
top gold standards are sorted and shown in Figure 0a) and 
(b) for UnknownCausal and UnknownTarget, respectively. The 
damping factor a in the information model is 0.99 here, because 
it generates the lowest e-values among all suggested values, 0.55, 



0.85 and 0.99 dStoimirovie and YuL 1201 lh . In UnknownCausal, 
because the given target genes are usually annotated with less GO 
terms, less gold standards have been found to be enriched than in 
UnknownTarget. In both UnknownCausal and UnknownTarget, 
an obvious trend is that the lowest e-values decrease when k 
increases. In UnknownCausal, the e-values obtained by the shortest 
path algorithm (k=l) are close to the information flow model 
and Tu's method, but when more paths are considered (&>3), 

shortest paths algorithm outperforms the information model and 
Tu's method. Similar results are observed in UnknownTarget. 
When k is increased from 5 to 10, the improvement is less because 
the 6th-10th paths are unimportant for regulatory pathways. 

In Figure 0a) and (b), GO a n is used instead. Again larger k can 
more significantly enrich GO terms. Two obvious differences are: 
(i) The e-values are much smaller, especially in UnknownCausal, 
and almost all gold standards can have an enriched GO term, because 
more GO terms are considered, (ii) The improvement from k = 5 
to & = 10 is apparent, possibly because potential GO terms are 
considered and thus more potential paths are useful to identify these 
pathways. 

The results illustrate that due to weight normalization and other 
problems stated in Section 13 the information flow model and Tu's 
method cannot achieve higher enrichment than our shortest paths 
algorithm, although both methods consider all possible paths. These 
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results are promising in that they agree with domain insights that a 
causal gene may regulate the target gene across three to five paths 
and that there is often an in-built redundancy within such pathways. 

5.4 Diversity, importance value and efficiency 

In this section, we report the comparison of diversity and efficiency. 
Since Tu's method and information flow model are not diverse path 
algorithms, we compare our single-source k diverse paths algorithm 
with a recently propo sed heuristic approach — Y-STATISTICAL 
dMalviva et all l201ll) , which is a single-pair k diverse paths 
algorithm. Y-STATISTICAL generates a path in each iteration and 
removes an edge in the last found path with a probability parameter 
Pr. The higher Pr is, the more diverse found k paths are. When 
Pr in Y-STATISTICAL is 0, Y-STATISTICAL is the same as Yen's 
algorithm. 

Let P\,P2,.-Pk be the k paths found by our original approach 
and \Pi \ is the number of edges in P(. We define k-paths diversity as 
the ratio of distinct edges in k paths, i.e. 

|PlUP 2 U...UP^| 



\pl\+\p2\+"'+\pk\ 

Figure 0a) presents the average importance value and the average 
&-paths diversity of the found paths from the starting nodes GGC1 to 
each other node. Here, we only show the result on one starting node, 
because as discussed later, Y-STATISTICAL (or Yen's algorithm) is 
too time consuming to generate all outputs for all starting nodes, 
but we empirically found that the comparisons on other starting 
nodes we have tested are very similar. Since for some nodes, the 
number of found diverse paths is smaller than k, we cannot always 
calculate the total distance of k paths (would go to infinity), and thus 
the importance value is used instead. Note that a higher importance 
value indicates that the found paths have lower distances and hence 
is preferred. We varied Pr in Y-STATISTICAL and k in our diverse 
paths algorithm from 0 to 1 with intervals 0.1 in Figure [7] 

As mentioned earlier, when the k paths are required to be more 
diverse, the total distance of the k paths increases, and therefore the 
importance value decreases. As can be seen in the leftmost point 
in Figure [3a), our heuristic shortest paths algorithm generates 
the same average importance value and the same average diversity 
as Yen's algorithm. We further compared the k paths found by 
our heuristic algorithm with the ^-shortest paths found by Yen's 
algorithm for each node and found that all k paths are equivalent. 
In other words, no dummy nodes exist in the pseudo tree of the 
yeast gene network with the starting node GGC1, so our heuristic 
algorithm can generate the same paths as the exact algorithm. 




-Y-STATISTICAL 
-our k diverse short paths alg. | 



-Y-STATISTICAL 
I — our k diverse short paths alg. 



average k-path diversity 



average k-path diversity 



Fig. 7. The comparison between our algorithm and Y-STATISTICAL on 
our yeast gene network. The starting node is GGC1. k = 5. (a) The average 
importance value vs. the average & -paths diversity and (b) Execution time 
vs. the average fc-paths diversity 



If the k paths are requested to be more diverse, our diverse paths 
algorithm can find shorter paths than Y-STATISTICAL when the 
£-paths diversity is lower than 0.9. This is because Y-STATISTICAL 
might remove several edges in the first few levels in the pseudo 
tree while these edges are usually used to compose short diverse 
paths. Our algorithm tends to preserve these paths so it can identify 
shorter paths when the diversity required is relatively small. Note 
that extremely large diversity values will lead to less meaningful 
results from a utility standpoint. 

Another advantage of our single- source algorithm is that it can 
find all shortest paths from the starting node to every other node, 
while it is very time consuming to execute n—l times a single-pair 
algorithm. As shown in Figure Ob), when k = 5, the execution time 
(including loading the graph) of our heuristic single- source k shortest 
paths algorithm in yeast gene network is less than 0.3 sec, whereas 
Yen's algorithm costs average about 800 sec. As Y-STATISTICAL 
removes more edges, the execution time of Y-STATISTICAL is 
decreased and as k in our k diverse paths algorithm increases, our 
algorithm needs more iterations to generate k paths for each node. 
However, our k diverse paths algorithm is still at least lOOx faster 
than Y-STATISTICAL using any value of k and Pr. 

5.5 Diversity and enriched functions 

In this section, we show that using different k, the importance 
values generated by the k diverse paths algorithm can identify 
different potential regulatory pathways with different functions. 
Since more gold standards are enriched in UnknownTarget than in 
UnknownCausal, we focused on UnknownTarget in this section. 
Figure He) reveals the maximal difference of the log e-value (The 
e-value is set to 1 if the term is not significantly enriched (> 0.01).) 
among all GO terms in GO an n in each gold standard between 
the diverse shortest paths and the shortest paths algorithm 
(dif = \og(Evaluex = o) — \og(Evaluex > o)). Supplementary Table 1 
presents the number of gold standards which have at least one 
significantly improved (dif>2) GO term in GO an n- Although as 
presented in Supplementary Figure 1, the diverse k short paths 
algorithm cannot decrease the lowest e-value, the diverse k short 
paths algorithm is able to enrich different GO terms with lower 
e-values. When k = 0.25, the maximal difference of e-value and 
the number of gold standards having an improved term are limited 
simply because the k diverse short paths are very similar to k shortest 
paths. However, if k > 0.75, the diverse paths can enrich different 
functions in a large number of gold standards as the maximal 
difference is larger. Generally, k = 0.15 or 1.0 is able to identify a 
large number of different potential enriched functions and pathways, 
whereas k < 0.25 produces similar paths to shortest paths and thus 
is relatively useless. We also show the maximal difference of the log 
e-value among all GO terms in GO a n in Figure Etc). The apparent 
difference in GO a n is that the maximal difference using k = 1.0 
is smaller than using k = 0.15. This is possibly because as all GO 
terms are considered, requiring completely diverse paths results in 
less interesting pathways. Therefore, k = 0.15 is a better setting to 
discover different functions which might annotate the given gene. 

Table [2] shows some enriched GO terms whose e-values are 
significantly different using different k, given starting node is RTS1. 
Examples with other starting nodes can be found in Supplementary 
Table 2. Not surprisingly, the e-values obtained using k = 0 are 
usually very closed to the e-values obtained using k = 0.25 since 



i56 



Regulatory pathways in a gene network 



Table 2. Case study: the e-values of some enriched GO terms found by k = 5 diverse short paths algorithm with different k, given the starting node RTS1 



GO number 


Function name 


k = 0 


A. = 0.25 


A = 0.5 


1 = 0.75 


A = 1.0 


Reference 


GO:0000075 a 


Cell cycle checkpoint 


4.55E-6 


6.43E-6 


1.75E-8 


1.11E-11 


5.27E-6 


(Chan and Amon. 2009) 


GO:0004721 a 


Phosphoprotein phosphatase activity 


1.29E-5 


4.38E-4 








(Wei etal. 2001) 


GO:0004722 a 


Protein serine/threonine phosphatase activity 


7.89E-7 


1.56E-6 


1.51E-5 






(Wei et oZ..200n 


GO:0005816 


Spindle pole body 






1.15E-4 


1.53E-6 






GO:0007062 a 


Sister chromatid cohesion 






9.33E-3 






(Riedel et al. 2006) 


GO:0007346 a 


regulation of mitotic cell cycle 


3.89E-5 


1.34E-5 


1.01E-8 


6.65E-13 


4.44E-6 


(Chan and Amon. 2009) 


GO:0031929 


TOR signaling cascade 


2.67E-8 


1.44E-8 


1.44E-6 




7.71E-3 





a These GO terms are in GO a nn, i-e. they annotate RTS1. 

E-values larger than 0.01 are denoted by representing insignificance. 




Fig. 8. The top 10 genes when A. = 0 and their main pathways, given the 
starting node RTS1. The weight of each edge is reported by its thickness. 
Darker nodes have higher importance values when k = 1.0. k = 5 

the found paths are very similar. In contrast, when k > 0.5, different 
sets of enriched GO terms can be identified. For example, given the 
starting node RTS1, protein serine/threonine phosphatase activity is 
most significantly enriched using k = 0. On the other hand, using 
k = 0.75 can most confidently identify the pathways regulating 
regulation of mitoti c cell cycle. As th e se GO terms have been f ound 
to annotate RTS1 (IChan and AmonL 120091 : Iwei et all l200lh . the 
result verifies that our algorithm is able to identify a number of 
regulatory pathways. Some GO terms not in GO an n are also shown 
in Table |2] and Supplementary Table 2 to point out potential GO 
terms which might be associated with the given gene. For example, 
RTS1 might associate with spindle pole body and TOR signaling 
cascade according to the e-values found by different k. 

Figured] visualizes the network consisting of the genes with the 
highest importance values using the starting node RTS 1 when k = 0. 
The top 10 genes when k = 0 are PPH21 > RRD2 > SGOl > TPD3 
> TAP24 > PPH22 > SIT4 > RRD1 > TOR2 > CDC55. The top 
10 ranked genes when k = 1.0 and their corresponding found paths 
are shown in Supplementary Table 3. It can be found that when k is 
increased from 0 to 1.0, the order of several genes is significantly 
changed. For example, CDC55 becomes the 5th, and ZDS2, which is 
ranked 11th, becomes 7th. The change of importance values results 
in the different enriched GO terms as discussed earlier. 



6 CONCLUSIONS 

We proposed a heuristic single- source shortest paths algorithm 
and a single-source k diverse short paths algorithm to address 
the pathway inference problem in gene networks. We pointed 



out that an exact single-source ^-shortest paths algorithm is 
practically infeasible, so a heuristic algorithm is adopted. A series 
of experiments was conducted to show that our algorithm can 
identify pathways with higher potentiality than current methods 
based on the random walk model, and requiring the paths to be 
more diverse can further uncover other potential regulated functions. 
Furthermore, our heuristic algorithm can achieve a huge speedup 
than the previous single-pair shortest paths algorithm while the 
found paths are equivalent in our yeast gene network. 

In the future, we would like to study the structure of the pseudo 
tree because a number of sub paths still repeatedly appear in the 
tree, and thus it is possible to develop a compressed data structure. 
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